Project Problem Statement - Potential Customers Prediction¶


Context¶


The EdTech industry has been surging in the past decade immensely, and according to a forecast, the Online Education market would be worth $286.62bn by 2023, with a compound annual growth rate (CAGR) of 10.26% from 2018 to 2023. The modern era of online education has enforced a lot in its growth and expansion beyond any limit. Due to having many dominant features like ease of information sharing, personalized learning experience, transparency of assessment, etc., it is now preferable to traditional education.

The online education sector has witnessed rapid growth and is attracting a lot of new customers. Due to this rapid growth, many new companies have emerged in this industry. With the availability and ease of use of digital marketing resources, companies can reach out to a wider audience with their offerings. The customers who show interest in these offerings are termed as leads. There are various sources of obtaining leads for Edtech companies, like:

The customer interacts with the marketing front on social media or other online platforms. The customer browses the website/app and downloads the brochure. The customer connects through emails for more information. The company then nurtures these leads and tries to convert them to paid customers. For this, the representative from the organization connects with the lead on call or through email to share further details.


Objective¶


ExtraaLearn is an initial stage startup that offers programs on cutting-edge technologies to students and professionals to help them upskill/reskill. With a large number of leads being generated on a regular basis, one of the issues faced by ExtraaLearn is to identify which of the leads are more likely to convert so that they can allocate the resources accordingly. You, as a data scientist at ExtraaLearn, have been provided the leads data to:

Analyze and build an ML model to help identify which leads are more likely to convert to paid customers. Find the factors driving the lead conversion process. Create a profile of the leads who are likely to convert.


Learning Outcomes¶


EDA(Univariate Analysis, Multi Variate Analysis)

Visualization

Data Preprocessing(Log transformations, Outlier Treatment, Missing Value Treatment, Feature Engineering)

Classification Models(Logistic Regression, Descision Trees, Random Forest)

Model Performance evaluation and improvement (Cross Validation Techniques)


Data Dictionary¶


The data contains the different attributes of leads and their interaction details with ExtraaLearn. The detailed data dictionary is given below.

ID: ID of the lead

age: Age of the lead

current_occupation: Current occupation of the lead. Values include 'Professional', 'Unemployed', and 'Student'

first_interaction: How did the lead first interact with ExtraaLearn? Values include 'Website' and 'Mobile App'

profile_completed: What percentage of the profile has been filled by the lead on the website/mobile app? Values include Low - (0-50%), Medium - (50-75%), High (75-100%)

website_visits: The number of times a lead has visited the website time_spent_on_website: Total time spent on the website page_views_per_visit: Average number of pages on the website viewed during the visits

last_activity: Last interaction between the lead and ExtraaLearn

Email Activity: Seeking details about the program through email, Representative shared information with a lead like a brochure of the program, etc.

Phone Activity: Had a phone conversation with a representative, had a conversation over SMS with a representative, etc.

Website Activity: Interacted on live chat with a representative, updated profile on the website, etc.

print_media_type1: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Newspaper

print_media_type2: Flag indicating whether the lead had seen the ad of ExtraaLearn in the Magazine

digital_media: Flag indicating whether the lead had seen the ad of ExtraaLearn on the digital platforms

educational_channels: Flag indicating whether the lead had heard about ExtraaLearn in the education channels like online forums, discussion threads, educational websites, etc.

referral: Flag indicating whether the lead had heard about ExtraaLearn through reference.

status: Flag indicating whether the lead was converted to a paid customer or not.

Importing the necessary libraries and overview of the dataset¶

In [44]:
# Import warnings
import warnings
warnings.filterwarnings("ignore")

# Libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Algorithms to use
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix, classification_report, f1_score
from sklearn import metrics
from xgboost import XGBRegressor, XGBClassifier
import multiprocessing
import shap
import xgboost as xgb

Loading the data¶

In [4]:
customer = pd.read_csv("ExtraaLearn.csv")
In [5]:
# Copying data to another variable to avoid any changes to original data
data = customer.copy()

View the first and the last 5 rows of the dataset¶

In [6]:
data.head()
Out[6]:
ID age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
0 EXT001 57 Unemployed Website High 7 1639 1.861 Website Activity Yes No Yes No No 1
1 EXT002 56 Professional Mobile App Medium 2 83 0.320 Website Activity No No No Yes No 0
2 EXT003 52 Professional Website Medium 3 330 0.074 Website Activity No No Yes No No 0
3 EXT004 53 Unemployed Website High 4 464 2.057 Website Activity No No No No No 1
4 EXT005 23 Student Website High 4 600 16.914 Email Activity No No No No No 0
In [45]:
data.tail()
Out[45]:
age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status engagement_score interaction_ratio
4607 35 Unemployed Mobile App Medium 2.772589 5.888878 1.153732 Phone Activity No No No Yes No 0 6.794185 1.287342
4608 55 Professional Mobile App Medium 2.197225 7.752765 1.855204 Email Activity No No No No No 0 14.382958 0.769551
4609 58 Professional Website High 1.098612 5.361292 1.306168 Email Activity No No No No No 1 7.002750 0.476380
4610 57 Professional Mobile App Medium 0.693147 5.043425 1.584940 Website Activity Yes No No No No 0 7.993528 0.268148
4611 55 Professional Website Medium 1.609438 7.736744 1.123305 Phone Activity No No No No No 0 8.690722 0.757987
In [7]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4612 entries, 0 to 4611
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   ID                     4612 non-null   object 
 1   age                    4612 non-null   int64  
 2   current_occupation     4612 non-null   object 
 3   first_interaction      4612 non-null   object 
 4   profile_completed      4612 non-null   object 
 5   website_visits         4612 non-null   int64  
 6   time_spent_on_website  4612 non-null   int64  
 7   page_views_per_visit   4612 non-null   float64
 8   last_activity          4612 non-null   object 
 9   print_media_type1      4612 non-null   object 
 10  print_media_type2      4612 non-null   object 
 11  digital_media          4612 non-null   object 
 12  educational_channels   4612 non-null   object 
 13  referral               4612 non-null   object 
 14  status                 4612 non-null   int64  
dtypes: float64(1), int64(4), object(10)
memory usage: 540.6+ KB
  • The dataset has 4,612 rows and 15 columns.

  • age, website_visits, time_spent_on_website, status and page_views_per_visitare of numeric, while the rest of the columns are objects

  • There are no null values in the dataset.

  • ID is an identifier. Let's check if each entry of the column is unique.

Observations:

  • We can see that all the entries of this column are unique. Hence, this column would not add any value to our analysis.
  • Let's drop this column.
In [8]:
data = data.drop(["ID"], axis = 1)
In [9]:
data.head()
Out[9]:
age current_occupation first_interaction profile_completed website_visits time_spent_on_website page_views_per_visit last_activity print_media_type1 print_media_type2 digital_media educational_channels referral status
0 57 Unemployed Website High 7 1639 1.861 Website Activity Yes No Yes No No 1
1 56 Professional Mobile App Medium 2 83 0.320 Website Activity No No No Yes No 0
2 52 Professional Website Medium 3 330 0.074 Website Activity No No Yes No No 0
3 53 Unemployed Website High 4 464 2.057 Website Activity No No No No No 1
4 23 Student Website High 4 600 16.914 Email Activity No No No No No 0

Exploratory Data Analysis and Data Preprocessing¶

Summary Statistics for numerical columns¶

In [10]:
#Observations
# Selecting numerical columns and checking the summary statistics
num_cols = data.select_dtypes('number').columns

data[num_cols].describe().T
Out[10]:
count mean std min 25% 50% 75% max
age 4612.0 46.201214 13.161454 18.0 36.00000 51.000 57.00000 63.000
website_visits 4612.0 3.566782 2.829134 0.0 2.00000 3.000 5.00000 30.000
time_spent_on_website 4612.0 724.011275 743.828683 0.0 148.75000 376.000 1336.75000 2537.000
page_views_per_visit 4612.0 3.026126 1.968125 0.0 2.07775 2.792 3.75625 18.434
status 4612.0 0.298569 0.457680 0.0 0.00000 0.000 1.00000 1.000

Observations:

Age:

  • Mean: 46.2 years; Median: 51 years → Slight left skew driven by younger ages.

  • Most ages fall between 36 (25th percentile) and 57 (75th percentile).

  • Range: 18 (youngest) to 63 (oldest).

Website Visits:

  • Mean: 3.57 visits; Median: 3 visits → Right skew observed.

  • Range: 0 to 30 visits, with most users visiting 2–5 times (IQR).

  • Subset of users showed high engagement (maximum of 30 visits).

Time Spent on Website:

  • Mean: 724 seconds (~12 minutes); Median: 376 seconds (~6 minutes) → Right skew present.

  • Range: 0 to over 2500 seconds (~40+ minutes), with significant variation in engagement.

  • Most users spent 148.75 (25th percentile) to 1336.75 (75th percentile) seconds.

Page Views Per Visit:

  • Mean: 3.03 pages; Median: 2.79 pages → Right skew detected.

  • Range: 0 to 18.43 pages, with the majority viewing 2–4 pages (IQR).

Status:

  • The binary target variable indicates 30% of users belong to category 1.

Checking the distribution and outliers for numerical columns in the data¶

In [11]:
for col in ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']:
    print(col)

    print('Skew :', round(data[col].skew(), 2))

    plt.figure(figsize = (15, 4))

    plt.subplot(1,2,1)

    data[col].hist(bins = 10, grid = False)

    plt.ylabel('count')

    plt.subplot(1, 2, 2)

    sns.boxplot(x = data[col])

    plt.show()
age
Skew : -0.72
No description has been provided for this image
website_visits
Skew : 2.16
No description has been provided for this image
time_spent_on_website
Skew : 0.95
No description has been provided for this image
page_views_per_visit
Skew : 1.27
No description has been provided for this image

Observations:

  • Age (-0.72 skew): Most ages are in the higher range; consistent, few outliers.

  • Website Visits (2.16 skew): Most visits are low, but a few users visit a lot; many outliers.

  • Time Spent (0.95 skew): Most spend little time, with a moderate tail toward longer times.

  • Pages per View (1.27 skew): Most view few pages, but some view significantly more; many outliers.

Check the percentage of each category for categorical variables.

In [12]:
#Check categorical variables.  Status is numerical but can also be categorical
cat_cols = ['current_occupation', 'first_interaction', 'profile_completed', 'last_activity', 'print_media_type1',
        'print_media_type2', 'digital_media', 'educational_channels', 'referral','status']

for col in cat_cols:
    print(data[col].value_counts(normalize = True))  # The parameter normalize = True gives the percentage of each category
    print('*'*40)
current_occupation
Professional    0.567216
Unemployed      0.312446
Student         0.120338
Name: proportion, dtype: float64
****************************************
first_interaction
Website       0.551171
Mobile App    0.448829
Name: proportion, dtype: float64
****************************************
profile_completed
High      0.490893
Medium    0.485906
Low       0.023200
Name: proportion, dtype: float64
****************************************
last_activity
Email Activity      0.493929
Phone Activity      0.267563
Website Activity    0.238508
Name: proportion, dtype: float64
****************************************
print_media_type1
No     0.892238
Yes    0.107762
Name: proportion, dtype: float64
****************************************
print_media_type2
No     0.94948
Yes    0.05052
Name: proportion, dtype: float64
****************************************
digital_media
No     0.885733
Yes    0.114267
Name: proportion, dtype: float64
****************************************
educational_channels
No     0.847138
Yes    0.152862
Name: proportion, dtype: float64
****************************************
referral
No     0.979835
Yes    0.020165
Name: proportion, dtype: float64
****************************************
status
0    0.701431
1    0.298569
Name: proportion, dtype: float64
****************************************

Observations:

  • Current Occupation: Most are professionals (56.7%), with smaller proportions of unemployed (31.2%) and students (12.0%).

  • First Interaction: Slightly more users interacted via the website (55.1%) than the mobile app (44.9%).

  • Profile Completed: High and medium completion levels are nearly equal (49.1% and 48.6%), with very few low completions (2.3%).

  • Last Activity: Email activity dominates (49.4%), followed by phone activity (26.8%) and website activity (23.9%).

  • Print Media: Both types show low engagement, with "No" responses at 89.2% and 94.9% respectively.

  • Digital Media: Most users do not engage (88.6%), while 11.4% do.

  • Educational Channels: Minimal use, with "Yes" at 15.3%.

  • Referral: Rare, with only 2.0% referred.

  • Status: 29.9% converted

Bivariate analysis.

In [13]:
# List of columns to plot
columns_to_plot = ['current_occupation', 'first_interaction', 'profile_completed',
                   'last_activity', 'print_media_type1', 'print_media_type2',
                   'digital_media', 'educational_channels', 'referral']

# Loop through each column
for col in columns_to_plot:

    # Create smaller count plots
    plt.figure(figsize=(6,3))  # Reduced size
    sns.countplot(x=col, hue='status', data=data)
    plt.title(f"Count Plot for {col} by Status", fontsize=12)
    plt.xlabel(col.capitalize(), fontsize=10)
    plt.ylabel("Count", fontsize=10)
    plt.legend(title='Status', fontsize=8)  # Adjust legend font size
    plt.xticks(fontsize=8)
    plt.yticks(fontsize=8)
    plt.show()

    # Create stacked bar plots with percentages
    if col != 'Attrition':
        crosstab = pd.crosstab(data[col], data['status'], normalize='index')*100
        ax = crosstab.plot(kind='bar', figsize=(8, 4), stacked=True)
        plt.ylabel('Percentage Status %')
        plt.title(f"Stacked Bar Plot for {col} by Status", fontsize=12)
        plt.xlabel(col.capitalize(), fontsize=10)
        plt.xticks(fontsize=8)
        plt.yticks(fontsize=8)
        plt.legend(title='Status', fontsize=8)

        # Annotate bars with percentages
        for p in ax.patches:
            width, height = p.get_width(), p.get_height()
            x, y = p.get_xy()
            ax.annotate(f'{height:.1f}%', (x + width/2, y + height/2), ha='center', va='center', fontsize=8, color='black')

        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:

  • 35% of professionals, 26.6% of unemployed individuals, and 11.7% of students converted to paid customers. Professionals had the largest volume of data available.

  • For first interactions, 45.6% of users converted via the website, compared to 10.5% via the mobile app.

  • Regarding profile completion, 41.8% of users with completed profiles converted, compared to 18.9% for medium completion and 7.5% for low completion.

  • In terms of last activity, the website had the highest conversion rate at 38.5%, followed by e-mail activity at 30.3% and phone activity at 21.3%. E-mail activity had the most extensive data available.

  • Referrals drove a conversion rate of 67.7%, significantly higher than the 29.1% conversion rate for non-referrals.

In [14]:
# List of numerical columns to loop through
columns_to_plot = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']

# Function to create both boxplots and a pair plot
def create_visualizations(data, columns, x_col):
    # Loop to create boxplots for each numerical column
    for col in columns:
        plt.figure(figsize=(10, 6))

        # Create the boxplot
        sns.boxplot(data=data, x=x_col, y=col)

        # Add labels and title
        plt.title(f'Boxplot of {col} by {x_col}')
        plt.xlabel(x_col.capitalize())
        plt.ylabel(col.capitalize())

        # Show plot
        plt.show()

    # Create a pair plot for all numerical columns with the target variable as hue
    sns.pairplot(data, hue=x_col, vars=columns, height=2.5)
    plt.suptitle("Pair Plots for Numerical Features by Status", y=1.02, fontsize=14)  # Adding a title
    plt.show()

# Call the function
create_visualizations(data, columns_to_plot, x_col="status")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:

  • Age and the amount of time spent on the website positively influence conversion rates.

  • For users who converted, the time spent on the website shows significant variability.

  • Among users who did not convert, the time spent on the website exhibits a right-skewed distribution with noticeable outliers.

  • Both page views per visit and the number of website visits demonstrate the presence of outliers and right-skewed distributions.

Apply log transformation to reduce skewness and add additional features:¶

Based on the above and re-examining the data, I will adjust for skewness:

Time Spent on Website:

  • Median (376) is much smaller than the mean (724), with a wide range (0 to 2537). This is likely highly right-skewed.

  • Log transformation is strongly recommended.

Website Visits:

  • Mean is higher than the median (3.566 vs. 3), and the max value (30) is quite far from the 75th percentile (5), indicating right skewness.

  • Log transformation would be helpful.

Page Views per Visit:

  • Mean (3.026) is slightly higher than the median (2.792), and the max (18.434) is far from the 75th percentile (3.75625). This indicates moderate right skewness.

  • Log transformation could improve the distribution.

Include Features for Classification (Exclude in Regression)¶

  • Certain fields are interconnected, so I've combined time_spent_on_website and page_views_per_visit into an engagement score to summarize user activity.
  • Additionally, I've introduced an interaction ratio to evaluate the intensity of user interactions on the website.e
In [15]:
# Log Transformation to address skewness above
for col in ['website_visits', 'time_spent_on_website', 'page_views_per_visit']:
    data[col] = np.log1p(data[col])  # log(1+x) to avoid log(0)

# Creating new features
data["engagement_score"] = data["time_spent_on_website"] * data["page_views_per_visit"]
data["interaction_ratio"] = data["website_visits"] / (data["page_views_per_visit"] + 1)  # Prevent division by zero

# Review data after transformation:
# Extend the list of numerical columns to include the new features
columns_to_plot = [
    'age',
    'website_visits',
    'time_spent_on_website',
    'page_views_per_visit',
    'engagement_score',
    'interaction_ratio'
]

# Function to create both boxplots and a pair plot to see if skewness is reduced
def create_visualizations(data, columns, x_col):
    # Loop to create boxplots for each numerical column
    for col in columns:
        plt.figure(figsize=(10, 6))

        # Create the boxplot
        sns.boxplot(data=data, x=x_col, y=col)

        # Add labels and title
        plt.title(f'Boxplot of {col} by {x_col}')
        plt.xlabel(x_col.capitalize())
        plt.ylabel(col.capitalize())

        # Show plot
        plt.show()

    # Create a pair plot for all numerical columns with the target variable as hue
    sns.pairplot(data, hue=x_col, vars=columns, height=2.5)
    plt.suptitle("Pair Plots for Numerical Features by Status", y=1.02, fontsize=14)  # Adding a title
    plt.show()

# Call the function
create_visualizations(data, columns_to_plot, x_col="status")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Observations:

  • The transformations have reduced skewness. Website Visits and Time Spent on Website: The transformations have reduced skewness, resulting in smoother distributions for both status 0 and status 1. Converted users (status 1) still appear more prominent at higher values, emphasizing their engagement.

  • Conversion rates (status 1) still appear strongly associated with higher values of engagement-related metrics (time_spent_on_website and page_views_per_visit).

  • Page Views per Visit: After scaling or transformation, the distributions now highlight subtle differences between the groups, with status 1 users exhibiting slightly longer tails in the higher values.

Pairwise correlations between all the variables.

In [16]:
plt.figure(figsize=(10, 7))

# Select only the numeric columns
datanumbers = data.select_dtypes(include='number')

# Plot the heatmap
sns.heatmap(datanumbers.corr(), annot=True, fmt=".2f")

plt.show()
No description has been provided for this image

Observations:

  • This highlights the positive correlation between status and age (12%) as well as status time spent on the website (25%).

  • The newly created engagement score also shows positive correlation with status.

  • Among the variables analyzed, time spent on the website exhibited the strongest correlation, reinforcing its significance in driving outcomes.

Preparing the data for modeling¶

In [17]:
# Separating the target variable and other variables
X = data.drop(columns = 'status')
Y = data['status']

# Creating dummy variables
X = pd.get_dummies(X, drop_first = True)

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 1)

Create a regression model and scale the dataset for adjustment. Later, we’ll use this to compare against classification models like Decision Trees and Random Forest along with XGBoost.¶

In [18]:
# This excludes the new features engagement score and interaction ratio and uses original features for cleaner interpretation and to avoid multicollinearity
# Function to evaluate metrics and plot confusion matrix
def metrics_score(actual, predicted, title="Confusion Matrix"):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8, 5))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Converter', 'Converter'], yticklabels=['Non-Converter', 'Converter'])
    plt.title(title)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

# 1. Define features and target
X = data[['age', 'current_occupation', 'first_interaction', 'profile_completed',
          'website_visits', 'time_spent_on_website', 'page_views_per_visit',
          'last_activity', 'print_media_type1', 'print_media_type2',
          'digital_media', 'educational_channels', 'referral']]
y = data['status']

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Identify numerical and categorical columns
numerical_columns = ['age', 'website_visits', 'time_spent_on_website', 'page_views_per_visit']
categorical_columns = ['current_occupation', 'first_interaction', 'profile_completed',
                       'last_activity', 'print_media_type1', 'print_media_type2',
                       'digital_media', 'educational_channels', 'referral']

# 4. Preprocessor: Scale numerical columns and encode categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_columns),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
    ]
)

# 5. Create a pipeline with preprocessing and Logistic Regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', LogisticRegression())
])

# 6. Train the model using the pipeline
pipeline.fit(X_train, y_train)

# 7. Make predictions on the test set
y_pred = pipeline.predict(X_test)

# 8. Make predictions on the training set and evaluate
y_pred_train = pipeline.predict(X_train)  # Corrected y_pred_train1 to y_pred_train
print("\n--- Training Set Performance ---")
metrics_score(y_train, y_pred_train, title="Confusion Matrix (Training Set)")

# 9. Evaluate performance on the test set
print("\n--- Test Set Performance ---")
metrics_score(y_test, y_pred)


# 10. Extract Coefficients and Feature Importance
# Get feature names (numerical + one-hot encoded)
feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_columns)
all_feature_names = numerical_columns + list(feature_names)

# 10a. Check for Multicollinearity using VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Preprocessed data for VIF calculation
X_preprocessed = pipeline.named_steps['preprocessor'].transform(X_train)
vif_data = pd.DataFrame()
vif_data["Feature"] = all_feature_names
vif_data["VIF"] = [variance_inflation_factor(X_preprocessed, i) for i in range(X_preprocessed.shape[1])]

print("\n--- Variance Inflation Factor (VIF) ---")
print(vif_data)

# Extract coefficients from the logistic regression model
coefficients = pipeline.named_steps['model'].coef_[0]

# Combine feature names, coefficients, and odds ratios
feature_importance = pd.DataFrame({
    'Feature': all_feature_names,
    'Coefficient': coefficients,
    'Odds Ratio': np.exp(coefficients)  # Convert coefficients to odds ratios
}).sort_values(by='Coefficient', ascending=False)

print("\n--- Feature Importance ---")
print(feature_importance)

# 11. Plot Confusion Matrix for the Test Set
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Converter', 'Converter'], yticklabels=['Non-Converter', 'Converter'])
plt.title("Confusion Matrix (Test Set)")
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# 12. Plot the ROC Curve and AUC
y_prob = pipeline.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label=f"AUC: {roc_auc_score(y_test, y_prob):.2f}")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
--- Training Set Performance ---
              precision    recall  f1-score   support

           0       0.86      0.92      0.89      2586
           1       0.77      0.65      0.71      1103

    accuracy                           0.84      3689
   macro avg       0.82      0.78      0.80      3689
weighted avg       0.83      0.84      0.83      3689

No description has been provided for this image
--- Test Set Performance ---
              precision    recall  f1-score   support

           0       0.86      0.92      0.89       649
           1       0.77      0.63      0.69       274

    accuracy                           0.83       923
   macro avg       0.81      0.78      0.79       923
weighted avg       0.83      0.83      0.83       923

No description has been provided for this image
/usr/local/lib/python3.11/dist-packages/statsmodels/stats/outliers_influence.py:197: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
--- Variance Inflation Factor (VIF) ---
                            Feature       VIF
0                               age  2.001388
1                    website_visits  1.150939
2             time_spent_on_website  1.217417
3              page_views_per_visit  1.131902
4   current_occupation_Professional       inf
5        current_occupation_Student       inf
6     current_occupation_Unemployed       inf
7      first_interaction_Mobile App       inf
8         first_interaction_Website       inf
9            profile_completed_High       inf
10            profile_completed_Low       inf
11         profile_completed_Medium       inf
12     last_activity_Email Activity       inf
13     last_activity_Phone Activity       inf
14   last_activity_Website Activity       inf
15             print_media_type1_No       inf
16            print_media_type1_Yes       inf
17             print_media_type2_No       inf
18            print_media_type2_Yes       inf
19                 digital_media_No       inf
20                digital_media_Yes       inf
21          educational_channels_No       inf
22         educational_channels_Yes       inf
23                      referral_No       inf
24                     referral_Yes       inf

--- Feature Importance ---
                            Feature  Coefficient  Odds Ratio
9            profile_completed_High     1.274829    3.578090
8         first_interaction_Website     1.127395    3.087602
2             time_spent_on_website     1.022725    2.780761
24                     referral_Yes     0.686616    1.986981
4   current_occupation_Professional     0.652937    1.921174
14   last_activity_Website Activity     0.411688    1.509364
0                               age     0.120159    1.127676
6     current_occupation_Unemployed     0.060079    1.061921
18            print_media_type2_Yes     0.037712    1.038432
12     last_activity_Email Activity     0.012840    1.012923
16            print_media_type1_Yes    -0.036738    0.963928
20                digital_media_Yes    -0.060821    0.940991
21          educational_channels_No    -0.127003    0.880731
22         educational_channels_Yes    -0.141959    0.867657
1                    website_visits    -0.149409    0.861217
3              page_views_per_visit    -0.150737    0.860074
19                 digital_media_No    -0.208141    0.812092
15             print_media_type1_No    -0.232224    0.792768
17             print_media_type2_No    -0.306674    0.735890
11         profile_completed_Medium    -0.307595    0.735213
13     last_activity_Phone Activity    -0.693491    0.499828
23                      referral_No    -0.955579    0.384589
5        current_occupation_Student    -0.981978    0.374569
10            profile_completed_Low    -1.236197    0.290487
7      first_interaction_Mobile App    -1.396357    0.247497
No description has been provided for this image
No description has been provided for this image

Observations:

Performance:

  • Training accuracy is 83%, and test accuracy is 84%, suggesting the model generalizes reasonably well without significant overfitting.

  • Class 1 recall (65% on training, 63% on test) is much lower than Class 0, indicating the model struggles to correctly identify positive cases (conversions).

Feature Importance:

  • Top Positive Predictors: profile_completed_High (OR = 3.57): Data states that those who completed their profile is 3.57 times more likely to convert vs. the reference group. Other top predictors, first_interaction_Website (OR = 3.08) and time_spent_on_website (OR = 2.78) highlight the importance of user engagement.

  • Top Negative Predictors: first_interaction_Mobile App (OR = 0.25) and profile_completed_Low (OR = 0.29) are most indicative of users unlikely to convert.

Recommendations:

  • Handle Multicollinearity: Although multicollinearity has been addressed using drop_first=True, many features still show infinite VIF values, indicating that the issue persists. Decision Trees and Random Forests are better suited for managing multicollinearity without requiring extensive feature engineering. While Lasso Regression or PCA could be applied, the goal is to use regression as a baseline for comparison. As a result, the focus will shift to classification models instead.

Building Classification Models¶

Decision Tree¶

In [19]:
# Define a function to plot the confusion matrix and classification report
def metrics_score(actual, predicted):
    print(classification_report(actual, predicted))
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(8, 5))
    sns.heatmap(cm, annot=True, fmt=".2f", xticklabels=["Converted", "Not Converted"], yticklabels=["Converted", "Not Converted"])
    plt.ylabel("Actual")
    plt.xlabel("Predicted")
    plt.title("Confusion Matrix")
    plt.show()

# Add the additional features
X = data[['age', 'current_occupation', 'first_interaction', 'profile_completed',
          'website_visits', 'time_spent_on_website', 'page_views_per_visit',
          'engagement_score', 'interaction_ratio',
          'last_activity', 'print_media_type1', 'print_media_type2',
          'digital_media', 'educational_channels', 'referral']]

#  Convert categorical variables to dummy variables - was getting error so re-getting dummy variable
X = pd.get_dummies(X, drop_first=False)  # Apply one-hot encoding with drop_first=True to avoid multicollinearity

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
In [20]:
# Fitting the decision tree classifier on the training data
d_tree =  DecisionTreeClassifier(random_state = 7)

d_tree.fit(X_train, y_train)
Out[20]:
DecisionTreeClassifier(random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=7)
In [21]:
# Checking performance on the training data
y_pred_train1 = d_tree.predict(X_train)

metrics_score(y_train, y_pred_train1)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2273
           1       1.00      1.00      1.00       955

    accuracy                           1.00      3228
   macro avg       1.00      1.00      1.00      3228
weighted avg       1.00      1.00      1.00      3228

No description has been provided for this image
In [22]:
# Checking performance on the testing data
y_pred_test1 = d_tree.predict(X_test)

metrics_score(y_test, y_pred_test1)
              precision    recall  f1-score   support

           0       0.85      0.86      0.86       962
           1       0.68      0.67      0.67       422

    accuracy                           0.80      1384
   macro avg       0.76      0.76      0.76      1384
weighted avg       0.80      0.80      0.80      1384

No description has been provided for this image

Observations:

  • The Decision Tree model is overfitting the training data, as expected, which reduces its ability to generalize well on the test set. Overfitting explains the perfect scores (e.g., precision, recall, F1-score) on the training set and the performance drop on the test set.

Test set performance shows:

  • Overall accuracy is 80%, which is solid but leaves room for improvement.

  • It’s worth noting that the imbalance in the dataset (majority vs. minority classes) affects the model’s ability to consistently predict the minority class. This is evident in the lower performance metrics for the minority class compared to the majority class in your second classification report.

Decision Tree - Hyperparameter Tuning¶

  • Class Imbalance Handling: By setting class_weight = 'balanced', addresses class imbalance, giving more weight to the minority class (Class 1), which is critical in imbalanced datasets.

  • Hyperparameter Tuning: Using GridSearchCV to tune parameters like max_depth, criterion, and min_samples_leaf ensures the model is optimized for the best performance.

  • Custom Scorer: By defining a custom scorer with f1_score for Class 1, focusing on a metric that balances precision and recall for the minority class. Since this is a startup, balancing resources (sales team capacity) while trying to maximize conversions.

  • Cross-Validation (cv = 5): Using cross-validation increases the robustness of hyperparameter tuning and guards against overfitting.

In [23]:
d_tree_tuned = DecisionTreeClassifier(random_state = 7, class_weight = 'balanced')

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 10),
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [5, 10, 20, 25]
             }

# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(d_tree_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
d_tree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data
d_tree_tuned.fit(X_train, y_train)
Out[23]:
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(5),
                       min_samples_leaf=20, random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=np.int64(5),
                       min_samples_leaf=20, random_state=7)
In [24]:
# Checking performance on the training data
y_pred_train2 = d_tree_tuned.predict(X_train)

metrics_score(y_train, y_pred_train2)
              precision    recall  f1-score   support

           0       0.94      0.84      0.89      2273
           1       0.70      0.86      0.77       955

    accuracy                           0.85      3228
   macro avg       0.82      0.85      0.83      3228
weighted avg       0.87      0.85      0.85      3228

No description has been provided for this image

Observations:

  • We can see that the performance on the training data has decreased which can be expected as we are trying not to overfit the training dataset.
  • The model is still able to identify conversion rates.
In [25]:
# Checking performance on the testing data
y_pred_test2 = d_tree_tuned.predict(X_test)

metrics_score(y_test, y_pred_test2)
              precision    recall  f1-score   support

           0       0.92      0.85      0.89       962
           1       0.71      0.84      0.77       422

    accuracy                           0.85      1384
   macro avg       0.82      0.85      0.83      1384
weighted avg       0.86      0.85      0.85      1384

No description has been provided for this image

Observations:

  • Model Generalization: Consistent performance between training and test sets suggests that hyperparameter tuning reduced overfitting effectively.
  • The lack of a significant gap between the training and test results (e.g., F1 scores, precision, recall) confirms that overfitting is effectively controlled. This reinforces the stability and reliability of the model's performance.
  • The Decision Tree shows better handling of the minority class (1), as seen in its higher recall and balanced F1-score.
  • The Decision Tree has higher accuracy compared to logistic regression.
  • The inclusion of additional features did not significantly affect the results, suggesting that the underlying features already capture the most critical predictive information.

We can reduce the depth to 3 and visualize it

In [26]:
tree_model = DecisionTreeClassifier(class_weight = 'balanced', max_depth = 3,
                       min_samples_leaf = 5, random_state = 7)

# Fit the best algorithm to the data
tree_model.fit(X_train, y_train)
Out[26]:
DecisionTreeClassifier(class_weight='balanced', max_depth=3, min_samples_leaf=5,
                       random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=3, min_samples_leaf=5,
                       random_state=7)
In [27]:
features = list(X.columns)

plt.figure(figsize = (20, 20))

tree.plot_tree(tree_model, feature_names = features, filled = True, fontsize = 9, node_ids = True, class_names = None)

plt.show()
No description has been provided for this image

Observations:

  • The tree starts with the feature first_interaction_Mobile App at the root, splitting based on whether the value is <= 0.5. This node samples where the first interaction was NOT through the mobile app.

  • This indicates that first_interaction_Mobile App is the most significant feature in determining outcomes, as it’s the first decision point.

  • 2nd most important is time spent on the website.

  • Interestingly, users whose first interaction is via the mobile app appear less likely to convert. This could point to potential issues with the mobile app experience, making it less engaging or effective compared to the website. This aligns with insights from the logistic regression model, reinforcing the conclusion.

In [28]:
# Importance of features in the tree building

print (pd.DataFrame(d_tree_tuned.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                                      Imp
first_interaction_Mobile App     0.319966
time_spent_on_website            0.259234
profile_completed_High           0.245264
current_occupation_Professional  0.052392
current_occupation_Student       0.034057
last_activity_Phone Activity     0.028137
last_activity_Website Activity   0.025097
age                              0.014836
profile_completed_Medium         0.008904
engagement_score                 0.008230
website_visits                   0.003880
first_interaction_Website        0.000000
current_occupation_Unemployed    0.000000
page_views_per_visit             0.000000
interaction_ratio                0.000000
last_activity_Email Activity     0.000000
profile_completed_Low            0.000000
print_media_type1_No             0.000000
print_media_type1_Yes            0.000000
print_media_type2_No             0.000000
print_media_type2_Yes            0.000000
digital_media_No                 0.000000
digital_media_Yes                0.000000
educational_channels_No          0.000000
educational_channels_Yes         0.000000
referral_No                      0.000000
referral_Yes                     0.000000
In [29]:
# Plotting the feature importance
importances = d_tree_tuned.feature_importances_

indices = np.argsort(importances)

plt.figure(figsize = (10, 10))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [features[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()
No description has been provided for this image

Observations:

  • Most important features are first_interaction_Mobile App, time_spent_on_website, and profile_competed_High.

  • The decision tree suggests that students are less likely to convert compared to professionals. This is evident in splits where the feature current_occupation_Student directs samples toward branches with a lower proportion of conversions. This also agrees with earlier bivariate plots and the regression analysis.

  • Media and referral features seem to contribute minimally to this analysis. While referral cases exhibit high conversion rates, their overall impact remains limited due to the small proportion of users who come through referrals.

Random Forest Classifier¶

In [30]:
# Fitting the random forest tree classifier on the training data
rf_estimator = RandomForestClassifier(random_state = 7, criterion = "entropy")

rf_estimator.fit(X_train,y_train)
Out[30]:
RandomForestClassifier(criterion='entropy', random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(criterion='entropy', random_state=7)
In [46]:
# Checking performance on the training data
y_pred_train3 = rf_estimator.predict(X_train)

metrics_score(y_train, y_pred_train3)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      2273
           1       1.00      1.00      1.00       955

    accuracy                           1.00      3228
   macro avg       1.00      1.00      1.00      3228
weighted avg       1.00      1.00      1.00      3228

No description has been provided for this image

Observations:

  • Similar to the decision tree, the random forest is giving a perfect/better performance on the training data.
  • The model is most likely overfitting to the training dataset as we observed for the decision tree.

Let's confirm this by checking its performance on the testing data

In [31]:
# Checking performance on the testing data
y_pred_test3 = rf_estimator.predict(X_test)

metrics_score(y_test, y_pred_test3)
              precision    recall  f1-score   support

           0       0.87      0.92      0.89       962
           1       0.79      0.69      0.74       422

    accuracy                           0.85      1384
   macro avg       0.83      0.81      0.82      1384
weighted avg       0.85      0.85      0.85      1384

No description has been provided for this image

Observations:

  • Overfitting in the First Set: The near-perfect performance on the training set suggests the model is overfitted, capturing noise or patterns specific to the training data.

  • Better Generalization in the Second Set: The drop in performance on the test set, especially for class 1, indicates the model struggles to generalize to unseen data.

Random Forest Classifier - Hyperparameter Tuning¶

In [32]:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)

# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
    "max_depth": [6, 7],
    "min_samples_leaf": [20, 25],
    "max_features": [0.8, 0.9],
    "max_samples": [0.9, 1],
    "class_weight" : ["balanced",{0: 0.3, 1: 0.7}]
             }
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned_base = grid_obj.best_estimator_
In [33]:
# Fitting the best algorithm to the training data
rf_estimator_tuned_base.fit(X_train, y_train)
Out[33]:
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
                       max_depth=7, max_features=0.8, max_samples=0.9,
                       min_samples_leaf=20, n_estimators=110, random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
                       max_depth=7, max_features=0.8, max_samples=0.9,
                       min_samples_leaf=20, n_estimators=110, random_state=7)
In [34]:
# Checking performance on the training data
y_pred_train4 = rf_estimator_tuned_base.predict(X_train)

metrics_score(y_train, y_pred_train4)
              precision    recall  f1-score   support

           0       0.94      0.86      0.90      2273
           1       0.72      0.86      0.78       955

    accuracy                           0.86      3228
   macro avg       0.83      0.86      0.84      3228
weighted avg       0.87      0.86      0.86      3228

No description has been provided for this image
In [35]:
# Checking performance on the test data
y_pred_test4 = rf_estimator_tuned_base.predict(X_test)

metrics_score(y_test, y_pred_test4)
              precision    recall  f1-score   support

           0       0.92      0.87      0.89       962
           1       0.73      0.83      0.78       422

    accuracy                           0.85      1384
   macro avg       0.82      0.85      0.83      1384
weighted avg       0.86      0.85      0.86      1384

No description has been provided for this image

Observations:

  • Performance Consistency: Training and test set metrics are very close, with no significant drops in performance. Not overfitted.

  • Hyperparameter Tuning: There is potential to further refine the model by experimenting with additional hyperparameters or adjusting the current hyperparameter values to enhance performance.

  • Efficiency in Tuning: Acknowledging that GridSearchCV can be computationally intensive, the number of values passed to each hyperparameter has been intentionally reduced to balance runtime with optimization efforts.

In [36]:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(criterion = "entropy", random_state = 7)

# Grid of parameters to choose from
parameters = {"n_estimators": [110, 120],
    "max_depth": [6, 7],
    "min_samples_leaf": [20, 25],
    "max_features": [0.8, 0.9],
    "max_samples": [0.9, 1],
    "class_weight" : ["balanced",{0: 0.3, 1: 0.7}]
             }

# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, parameters, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
In [37]:
# Fitting the best algorithm to the training data
rf_estimator_tuned.fit(X_train, y_train)
Out[37]:
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
                       max_depth=7, max_features=0.8, max_samples=0.9,
                       min_samples_leaf=20, n_estimators=110, random_state=7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.3, 1: 0.7}, criterion='entropy',
                       max_depth=7, max_features=0.8, max_samples=0.9,
                       min_samples_leaf=20, n_estimators=110, random_state=7)
In [38]:
# Checking performance on the training data
y_pred_train5 = rf_estimator_tuned.predict(X_train)

metrics_score(y_train, y_pred_train5)
              precision    recall  f1-score   support

           0       0.94      0.86      0.90      2273
           1       0.72      0.86      0.78       955

    accuracy                           0.86      3228
   macro avg       0.83      0.86      0.84      3228
weighted avg       0.87      0.86      0.86      3228

No description has been provided for this image
In [39]:
# Checking performance on the test data
y_pred_test5 = rf_estimator_tuned.predict(X_test)

metrics_score(y_test, y_pred_test5)
              precision    recall  f1-score   support

           0       0.92      0.87      0.89       962
           1       0.73      0.83      0.78       422

    accuracy                           0.85      1384
   macro avg       0.82      0.85      0.83      1384
weighted avg       0.86      0.85      0.86      1384

No description has been provided for this image

Observations:

Performance Summary

  • Accuracy: 86% for both training and 85% test sets, showing good generalization.

  • Class 0: High precision (95%, 92%) and recall (86%, 87%), with strong F1-scores (90%, 89%).

  • Class 1: Moderate precision (72%, 73%) but strong recall (86%, 83%), F1-scores at 78% for both sets.

  • After hyperparameter tuning, the Random Forest model performs slightly better than the Decision Tree, achieving marginally higher overall accuracy and F1 scores for Class 1. However, the improvement is not substantial. If computational efficiency is a priority, the Decision Tree remains a valid choice due to its comparable performance. Additionally, decision trees are somewhat easier to interpret.

In [40]:
importances = rf_estimator_tuned.feature_importances_

indices = np.argsort(importances)

feature_names = list(X.columns)

plt.figure(figsize = (12, 12))

plt.title('Feature Importances')

plt.barh(range(len(indices)), importances[indices], color = 'violet', align = 'center')

plt.yticks(range(len(indices)), [feature_names[i] for i in indices])

plt.xlabel('Relative Importance')

plt.show()
No description has been provided for this image

Observations:

  • The most important features appear to be time_spent_on_website, first_interaction_Website, and profile_completed_High, followed by profile_completed_medium. This model places slightly more emphasis on time_spent_on_website compared to first_interaction_Website than the decision tree does, but the top three features remain consistent.
  • Many other features, including media-related and referral features, contribute little to nothing in importance.
  • While a high number of referrals converted, there were very few referrals compared to non referrals.While it is an effective channel, it has limited reach. *Random Forest after tuning had the most accuracy. Decision Trees were almost as good with slightly better interpretibility. The models seem to agree in general.

XGBoost Regressor¶

In [41]:
from xgboost import XGBClassifier

# Choose the type of classifier
xgb_estimator_tuned = XGBClassifier(eval_metric='logloss', random_state=7)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [110, 120],
    "max_depth": [6, 7],
    "learning_rate": [0.1, 0.2],
    "subsample": [0.8, 0.9],
    "colsample_bytree": [0.8, 0.9],
    "scale_pos_weight": [1, 3],  # Adjust for class imbalance
}

# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label=1)

# Run the grid search
grid_obj = GridSearchCV(xgb_estimator_tuned, parameters, scoring=scorer, cv=5)

grid_obj = grid_obj.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
xgb_estimator_tuned = grid_obj.best_estimator_
In [42]:
# Predictions and Metrics for XGBoost
y_pred_train_xgb = xgb_estimator_tuned.predict(X_train)
y_pred_test_xgb = xgb_estimator_tuned.predict(X_test)

print("XGBoost - Training Results:")
metrics_score(y_train, y_pred_train_xgb)
print("XGBoost - Test Results:")
metrics_score(y_test, y_pred_test_xgb)
XGBoost - Training Results:
              precision    recall  f1-score   support

           0       0.97      0.98      0.97      2273
           1       0.94      0.92      0.93       955

    accuracy                           0.96      3228
   macro avg       0.95      0.95      0.95      3228
weighted avg       0.96      0.96      0.96      3228

No description has been provided for this image
XGBoost - Test Results:
              precision    recall  f1-score   support

           0       0.88      0.92      0.90       962
           1       0.79      0.71      0.75       422

    accuracy                           0.86      1384
   macro avg       0.84      0.81      0.82      1384
weighted avg       0.85      0.86      0.85      1384

No description has been provided for this image

Observations:

The XGBoost might be overfitting as training results are much better than test results. Apply regularization techniques, just as we did for Decision Trees and Random Forest.

In [47]:
# Choose the type of classifier
xgb_estimator_tuned = XGBClassifier(eval_metric='logloss', random_state=7)

# Grid of parameters to choose from
parameters = {
    "n_estimators": [100, 200],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.1, 0.2],
    "subsample": [0.8, 0.9, 1.0],
    "colsample_bytree": [0.8, 0.9, 1.0],
    "scale_pos_weight": [1, 3],  # Adjust for class imbalance
    "reg_alpha": [0, 0.1, 1],  # L1 regularization
    "reg_lambda": [1, 1.5, 2]  # L2 regularization
}

# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = metrics.make_scorer(f1_score, pos_label=1)

# Run the randomized search
random_search = RandomizedSearchCV(xgb_estimator_tuned, parameters, scoring=scorer, cv=5, n_iter=50, n_jobs=multiprocessing.cpu_count())
random_search = random_search.fit(X_train, y_train)

# Set the classifier to the best combination of parameters
xgb_estimator_tuned = random_search.best_estimator_

# Predictions and Metrics for XGBoost
y_pred_train_xgb = xgb_estimator_tuned.predict(X_train)
y_pred_test_xgb = xgb_estimator_tuned.predict(X_test)

print("XGBoost - Training Results:")
metrics_score(y_train, y_pred_train_xgb)
print("XGBoost - Test Results:")
metrics_score(y_test, y_pred_test_xgb)
XGBoost - Training Results:
              precision    recall  f1-score   support

           0       0.90      0.94      0.92      2273
           1       0.84      0.75      0.80       955

    accuracy                           0.89      3228
   macro avg       0.87      0.85      0.86      3228
weighted avg       0.88      0.89      0.88      3228

No description has been provided for this image
XGBoost - Test Results:
              precision    recall  f1-score   support

           0       0.88      0.93      0.91       962
           1       0.82      0.71      0.76       422

    accuracy                           0.86      1384
   macro avg       0.85      0.82      0.83      1384
weighted avg       0.86      0.86      0.86      1384

No description has been provided for this image
In [48]:
# SHAP values for feature importance
explainer = shap.Explainer(xgb_estimator_tuned)
shap_values = explainer(X_train)

# Plot the SHAP summary plot for feature importance
shap.summary_plot(shap_values, X_train)
No description has been provided for this image

Observations:

  • XGBoost and Random Forest both identify the same top three features as the most important. However, XGBoost places Professional in the fourth position, while Random Forest ranks it lower. Despite this difference, the overall insights from both models remain consistent.

Observations:

  • Accuracy: Both models have similar accuracy on the test set (0.85 for Random Forest and 0.86 for XGBoost).
  • Class 0 Performance: XGBoost has slightly higher recall and F1-score for Class 0 on the test set.
  • Class 1 Performance: Random Forest has higher recall for Class 1 on the test set, while XGBoost has higher precision.
  • Overfitting: XGBoost shows a larger gap between training and test performance, indicating potential overfitting. Regularization techniques applied to XGBoost (like reg_alpha and reg_lambda) help mitigate this but might need further tuning.
  • Random Forest: Easier to interpret and performs well with balanced precision and recall.
  • XGBoost: Slightly better overall performance but may require careful tuning to avoid overfitting.

Final Takeaways¶

Model Summary

  • Each iteration—starting with Logistic Regression, progressing to tuned Decision Trees, Random Forest, and finally tuned XGBoost—demonstrated incremental improvements in overall performance metrics.

  • While the Decision Tree model is slightly less effective, it still performs well as a standalone option, especially when computational power is a limiting factor.

  • Logistic Regression excels in metrics for Class 0 but struggles with recall for Class 1, making it less suited for imbalanced datasets. It also suffers from multicollinearity issues, which tree-based models like Decision Trees, Random Forest, and XGBoost handle more effectively.

  • The models showed consistency in identifying top, medium, and insignificant contributors. Although the ranking of features varied slightly, the results remained largely consistent.

Business Insights

  • Mobile App Experience: Users who start their journey with the mobile app show lower conversion rates. This indicates a need to improve the app experience, as the website currently provides a more effective pathway for user engagement and conversion.

  • Digital Media Ads: With no noticeable impact on conversions, the budget allocated to digital ads could be redirected towards enhancing the mobile app experience, which holds greater potential for improving results.

  • User Engagement and Profile Completion: High and medium profile completion rates are critical factors for conversions. Marketing strategies should target users with at least medium-level profile completion, and personalized reminder emails can be sent to encourage sign-ups.

  • Website Optimization: Website-related factors consistently rank as the most influential drivers of conversion. Enhancing website functionality and engagement features should remain a top priority, focusing on users who are highly active on the site.

  • Referrals: Although referrals were not identified as significant by the models due to the limited number of cases, their high conversion rates are evident in bivariate analysis. Introducing referral bonus programs could increase the volume of referrals, amplifying their impact in future models.

  • Student Segment: Students are the least likely to enroll, possibly due to existing school options that reduce the need for EdTech solutions. Introducing tiered pricing could make the offerings more appealing and increase demand within this segment.